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Abstract 

We study the properties of the MDL (or max- 
imum penalized complexity) estimator for 
Regression and Classification, where the un- 
derlying model class is countable. We show 
in particular a finite bound on the Hellinger 
losses under the only assumption that there is 
a "true" model contained in the class. This 
implies almost sure convergence of the pre- 
dictive distribution to the true one at a fast 
rate. It corresponds to Solomonoff's central 
theorem of universal induction, however with 
a bound that is exponentially larger. 

Keywords. Regression, Classification, Se- 
quence Prediction, Machine Learning, Min- 
imum Description Length, Bayes Mix- 
ture, Marginalization, Convergence, Discrete 
Model Classes. 



1. Introduction 

Baycsian methods are popular in Machine Learning. 
So it is natural to study their predictive properties: 
How do they behave asymptotically for increasing sam- 
ple size? Are loss bounds obtainable, either for cer- 
tain interesting loss functions or even for more general 
classes of loss functions? 

In this paper, we consider the two maybe most impor- 
tant Bayesian methods for prediction in the context of 
regression and classification. The first one is marginal- 
ization: Given some data and a model class, obtain a 
predictive model by integrating over the model class. 
This Bayes mixture is "ideal" Bayesian prediction in 
many respects, however in many cases it is computa- 
tionally untractable. Therefore, a commonly employed 
method is to compute a maximum penalized complexity 
or maximum a posteriori (MAP) or minimum descrip- 
tion length (MDL) estimator. This predicts according 
to the "best" model instead of a mixture. The MDL 



principle is important for its own sake, not only as 
approximation of the Bayes mixture. 

Most work on Bayesian prediction has been carried 
out for continuous model classes, e.g. classes with one 
free parameter i? £ R d . While the predictive prop- 
erties of the Bayes mixture are excellent under mild 
conditions |CB»0I IHutOSbl IGGvdVOOl IHut()4| . corre- 
sponding MAP or MDL results are more difficult to 
establish. For MDL in the strong sense of description 
length, the parameter space has to be discretized ap- 
propriately (and dynamically with increasing sample 
size) |Ris96llBdW98llBC91| . A MAP estimator on the 
other hand can be very bad in general. In statistical 
literature, some important work has been performed 
on the asymptotical discovery of the true parameter, 
e.g. |CY00| . This can only hold if each model occurs 
no more than once in the class. Thus it is violated e.g. 
in the case of an artificial neural network, where ex- 
changing two hidden units in the same layer does not 
alter the network behavior. 

In the case of discrete model classes, both loss bounds 
and asymptotic assertions for the Bayes mixture are 
relatively easy to prove, compare Theorem |5J In 
PH04a , corresponding results for MDL were shown. 
The setting is sequence prediction but otherwise very 
general. The only assumption necessary is that the 
true distribution is contained in the model class. As- 
sertions are given directly for the predictions, thus 
there is no problem of possibly undistinguishable mod- 
els. In order to prove that the MDL estimator (pre- 
cisely, the static MDL estimator in terms of |PH04a| ) 
has good predictive properties, we introduce an inter- 
mediate step and show first the predictive properties 
of dynamic MDL, where a new MDL estimator is com- 
puted for each possible next observation. 

In this paper, we will derive analogous results for re- 
gression and classification. While results for classi- 
fication can be generalized from sequence prediction 
by conditionalizing everything to the input, regres- 



sion is technically more difficult. Therefore the next 
section, which deals with the regression setup, covers 
the major part of the paper. Instead of the popular 
Euclidian and Kullback-Leibler distances for measur- 
ing prediction quality we need to exploit the Hellinger 
distance. We show that online MDL converges to the 
true distribution in mean Hellinger sum, which implies 
"rapid" convergence with probability one. Classifica- 
tion is briefly discussed in Section |3 followed by a 
discussion and conclusions in Section 

2. Regression 

We neglect computational aspects and study the prop- 
erties of the optimal Bayes mixture and MDL predic- 
tors. When a new sample is observed, the estimator is 
updated. Thus, regression is considered in an online 
framework: The first input x\ is presented, we pre- 
dict the output ?/i and then observe its true value, the 
second input X2 is presented and so on. 

Setup. Consider a regression problem with arbi- 
trary domain X (we need no structural assumptions 
at all on X) and co-domain y = R. The task is to 
learn/fit /infer a function / : X — > y, or more gener- 
ally a conditional probability density v(y\x), from data 
{(xi, y\), (x n , y n )}- Formally, we are given a count- 
able class C of models that are functions v from X to 
uniformly hounded •probability densities on R. That is, 
C = {vi : i > 1}, and there is some C > such that 



is, C 1 ™ 1 = {v a ^ a :a,be Q}, where 
v a > b > tr (x,y) = <j> a2 (y-ax-b) 



1 Kyty — ax — b) 2 



< Vi{y\x) < C and / Vr(y\x)dy = 1 (1) 

J — oo 

for all z > 1, x G X, and y G y. 

Each v induces a probability density on R™ for n- 
tuples xi-n G X n by v(yi : „\xu n ) = n"=i v{yt\x t ). 
The notation xi- n for n-tuples is common in sequence 
prediction. Each model v G C is associated with a 
prior weight w v > 0. The logarithm log 2 uv has of- 
ten an interpretation as model complexity. We require 
J2v w f — 1- Then by the Kraft inequality, one can 
assign to each model v G C a prefix-code of length 



[ lo. 



g2""V • 

We assume that an infinite stream of data (x 1:oc , j/i :00 ) 
is generated as follows: Each xt may be produced by an 
arbitrary mechanism, while y t is sampled from a true 
distribution fi conditioned on xt- In order to obtain 
strong convergence results, we will require that /i G C. 

Example 1 Take X = R and C^ inl S {ax + b + 
N(0, a 2 ) : a, 6 G Q} to be the class of linear regres- 
sion models with rational coefficients a, b, and inde- 
pendent Gaussian noise of fixed variance a 2 > 0. That 



V2 

Alternatively, you may consider the class C>"* = 
{v a > h ' a : a,6,cr G Q,cr > do} for some cto > 0, where 
also the noise amplitude is part of the models. In 
the following, we also discuss how to admit degenerate 
Gaussians that are point measures such as C 1 "^ 1 . 

The setup {Q guarantees that all subsequent MDL es- 
timators and IjlOp] exist. However, our results and 
proofs generalize in several directions. First, for the co- 
domain y we may choose any cr-finite measure space 
instead of R, since we need only Radon-Nikodym den- 
sities below. Second, the uniformly boundedness con- 
dition can be relaxed, if the MDL estimators still exist. 
This holds for example for the class C^g 1 (see the pre- 
ceding example), if the definition of the MDL estima- 
tors is adapted appropriately (see footnote |21 on page 
UJ. Third, the results remain valid for semimeasures 
with J v < 1 instead of measures and ^2 w v < F which 
is however not very relevant for regression (but for uni- 
versal sequence prediction). In order to keep things 
simple, we develop all results on the basis of QJ. Note 
finally that the models in C may be time-dependent, 
and we need not even make this explicit, since the time 
can be incorporated into X (xt — (x' t , t) G X' x N — 
X). In this way we may also make the models de- 
pend on the actual past outcome, if this is desired 

(x t = (zi^i:*-!) g x'* x y* = X). 

The case of independent Gaussian noise as in Exam- 
ple ^ is a particularly important one. We therefore 
introduce the family 

^ auss = {C = {u i ,a i }^ 1 :u i (x,y)= (2) 
cb a 2 (y - ^ (x)) , o-i > a > 0, fi : X -> r} . 

of all countable regression model classes with lower 
bounded Gaussian noise. Clearly, C 1 ™ 1 , G ^ auss 
is satisfied. Similarly :F Gauss z) jf^ uss denotes 
the corresponding family without lower bound on Cj. 
Then C^ 1 G T Gauss \ T^ nss . 

We define the Bayes mixture, which for each n > 1 
maps an n-tuple of inputs x± :n G X n to a probability 
density on R n : 

n 

£(yi:n\xi:n) = ^ (3/l:n|»l:n) = ^ W v ]~[ v{y t \x t ) 

veC vec t—i 

(3) 

(recall ^2 v w v — 1). Hence, the Bayes mixture dom- 
inates each v by means of £,(-\xi-. n ) > w u v(-\xi: n ) for 
all Xi-.n- For v G C and x n G X, the ^-prediction of 



y n € K, that is the ?iit-probability density of observing 

Vn, IS 

v(Vn\xi:n,y<n) = K^/nl^n)- 

This is independent of the history (x <n ,y <n ) = 
(xi :rl _i, yi :n _i). In contrast, the Bayes mixture pre- 
diction or regression, which is also a measure on R, 
depends on the history: 



£(y n \xi:n,y<n) = 



t(y<n\x <n ) Ev^nlLi v(yt\xt) 

(4) 

This is also known as marginalization. Observe that 
the denominator in (0J vanishes only on a set of fi- 
measure zero, if the true distribution fi is contained in 
C. Under condition |T|. the Bayes mixture prediction 
is uniformly bounded. It can be argued intuitively 
that in case of unknown /i S C the Bayes mixture is 
the best possible model for fi. Formally, its predictive 
properties are excellent: 

Theorem 2 Let fi G C, n > 1, and x± :rl S i/ien 

/ rVM^LT^t) - Vt,(yt\xi:t,y<t)) dy t (5) 



" lnw,, 1 . 

H 1 



E denotes the expectation with respect to the true 
distribution /i. Hence in this case we have E . . . = 
J . . ./i(dy<t). The integral expression is also known 
as square Hellinger distance. It will emerge as a main 
tool in the subsequent proofs. So the theorem states 
that on any input sequence x<oo the expected cumu- 
lated Hellinger divergence of fi and the Bayes mixture 
prediction is bounded by \niu~ 1 . A closely related 
result was discovered by Solomonoff : So'.TS ) for uni- 
versal sequence prediction, a "modern" proof can be 
found in |Hut04| . This proof can be adapted in our 
regression framework. Alternatively, it is not difficult 
to give a proof in a few lines analogous to l|14(l and 
(|T5j) by using 

We introduce the term convergence in mean Hellinger 
sum (i.m.H.s.) for bounds like ©: For some pre- 
dictive density -0, the ^-predictions converge to the 
/i-predictions i.m.H.s. on a sequence of inputs x <oa G 
X°°, if there is R > such that H^ <oo (fj,,ip) < R, 
where 



ff| <o >,V>) with 



(6) 



h 2 t= / ( VJ4m\xuuV<t) - V^(yt\xi:t,y<t)) dy 



Convergence i.m.H.s. is a very strong convergence cri- 
terion. It asserts a finite expected cumulative Hellinger 



loss in the first place. If the co-domain y is finite as for 
classification (see Section[2J|, then convergence i.m.H.s. 
implies almost sure (a.s.) convergence of the (finitely 
many) posterior probabilities. For regression, the situ- 
ation is more complex, since the posterior probabilities 
are densities, i.e. Banach space valued. Here, conver- 
gence i.m.H.s. implies that with /x-probability one the 
square roots of the predictive densities converge to the 
square roots of the /z-densities in I? (K) (endowed with 
the Lebesgue measure). In other words, tij converges 
to zero a.s.: 



P 3i > n : hi > e 



t>n 

< E p (^M 

t>n 
oo 



holds by the union bound, the Markov inequality for 
all e > 0, and iJ^ < oo, respectively, where P is the 
/i-probability. If the densities arc uniformly bounded, 
then also the differences of the densities (as opposed 
to the difference of the square roots) converge to zero: 

4>(yt\xi-.t,y<t) - v{yt\xi:t,y<t) inL 2 (R) a.s. 

Moreover, the finite bound on the cumulative Hellinger 
distances can be interpreted as a convergence rate. 
Compare the parallel concept "convergence in mean 
sum" |Hut03b[ lP!T04al IHut04j . 

MDL Predictions. In many cases, the Bayes mix- 
ture is not only intractable, but even hard to approx- 
imate. So a very common substitute is the (ideal) 
MDL 1 estimator, also known as maximum a posteri- 
ori (MAP) or maximum complexity penalized likeli- 
hood estimator. Given a model class C with weights 
(uv) and a data set (xi :n , yi-.n), we define the two-part 
MDL estimator as 



(xi:„,J/l : „) 

g(yi:n\xi:n) 



argmax{uvj/(?/i : „|3;i :n )} and 
m&x{w v u(y 1:n \xi: n )} (8) 

Wv*V*{yi:n\%l:n)- 



Note that we define both the model v* which is the 
MDL estimator and its weighted density p. In our 



There is some disagreement about the exact meaning 
of the term MDL. Sometimes a specific prior is associated 
with MDL, while we admit arbitrary priors. More impor- 
tantly, when coding some data x, one can exploit the fact 
that once the model v* is specified, only data which lead 
to the maximizing element v* need to be considered. This 
allows for a shorter description than log 2 ^*(:r). Neverthe- 
less, the construction "principle is common ly term ed MDL, 
compare for instance the "ideal MDL" in |VL00| . 



setup 0, the MDL estimator is well denned, since 
all maxima exist 2 . Moreover, g(-\x\- n ) is a density 
but its integral is less than 1 in general. We have 
Q(-\%i:n) > W v v(-\x\- n ), so like £, g dominates each 
v £ C. Also, g(-\xi-.n) < £,(-\xi:n) is clear by definition. 
If we use v* for (sequential online) prediction, this is 
the static MDL prediction: 



g statiC {y n \xi:n,y<n) = v* x 



<n,V<tl) 



(y n \x n )- (9) 



This is the common way of using MDL for prediction. 
Clearly, the static MDL predictor is a probability den- 
sity on R. Alternatively, we may compute the MDL 
estimator for each possible y n separately, arriving at 
the dynamic MDL predictor: 



g(y n \xi:n,y< n ) 



g(yi:n\xi:n) 

g(y<n\x <n )' 



(10) 



We have g(y n \xi :n ,y <n ) < v ( Xl . n . Vl . n )i.Vn\x n ) for each 
y n , which shows that under condition Q the dynamic 
MDL predictor is uniformly bounded. On the other 
hand, Q[y n \xi :n ,y <n ) > v * x<n , y<n ){yn\x n ) holds, so 
the dynamic MDL predictor may be a density with 
mass more than 1. Hence we must usually normalize 
it for predicting: 



g(y n \xi: ni y <n ) 



g(yi:n\xy.n) 
J g(yi:n\xi:n)dy„' 



(11) 



Both fractions in 1)11) [) and are well-defined except 
for a set of measure zero. Dynamic MDL predictions 
are in a sense computationally (almost) as expensive 
as the full Bayes mixture. 

Convergence Results. Our principal aim is to prove 
predictive properties of static MDL, since this is the 
practically most relevant variant. To this end, we first 
need to establish corresponding results for the dynamic 
MDL. Precisely, the following holds. 

Theorem 3 Assume the setup f?)) . If A* G C> where \x 
is the true distribution, and H 2 , (•, •) is defined as in 



0), then for all input sequences 



£ X° 



(i) 
(it) 
(Hi) 



T ~ r Q,Q) < 2W" 1 , and 
H 2 x< Jg,g^ tic )<3w-\ 



2 For a model class with Gaussian noise C £ jF Gauss (5J, 
we may dispose of the uniform boundedness condition and 
admit e.g. also C^q 1 . In order to compute the MDL esti- 
mator, we must then first check if there is nonzero mass 
concentrated on (xi :n ,yv.n), in which case the mass is even 
one and the corresponding model with the largest weight is 
chosen. Otherwise, the MDL estimator is chosen according 
to the maximum penalized density. All results and proofs 
below generalize to this case. 



Since the triangle inequality holds for 
immediately conclude: 



tf* 2 <Jv),we 



Corollary 4 Given the setup QJ) and fi £ C, then all 
three predictors g, g, and g statlc converge to the true 
density fj, in mean Hellinger sum, for any input se- 
quence x <OQ . In particular, we have H 2 ([i, g statlc ) < 
2lw-\ 

We will only prove (i) of Theorem [3] here. The proofs 
of (ii) and (Hi) can be similarly adapted from PH04a 
Theorems 10 and 11], since the Hellinger distance 
is bounded by the absolute distance: J i-J fi(y) — 

2 i 

y/v(y)) dy < J\fi(y) - v(y)\dy follows from (y/a - 
Vb) 2 < |o — b\ for any a, b £ R (this shows also that 
the integral h 2 in © exists). In order to show (i), we 
make use of the fact that the squared Hellinger dis- 
tance is bounded by the Kullback-Leibler divergence: 



(y»(y) -VKvi) d y< I Kv) 



^dy 

v (y) 



(12) 



for any two probability densities /i and v on R (see e.g. 
BM98 p. 178]). So we only need to establish the cor- 
responding bound for the Kullback-Leibler divergence 
and show 



D x(n\\g) 



t=i 



E / n{yt\xi; t ,y<t)}& 



fj-(yt\xi-. t ,y<t) 
g(yt\xi-.t,y<t) 



dyt 



< 



if,, 



Inw,, 



(13) 



for all n > 1. In the following computation, we take 
x<oo to be fixed and suppress it in the notation, writ- 
ing e.g. fi(yt\y<t) instead of n(yt\xi-.t,y<t)- Then 



d x (p\\q) 



^ E]n ^l»<*) 
t 

E 

t 



E 



In 



e(yt\y<t) 
v(yt\y<t) 



Q{Vt\v<t) 



hi 



(14) 
/ Q(yi-.t)dyt 

Q(y<t) . 



The first part of the last term is bounded by 



E Eln 



Kvt\y<t) 
Q(yt\y<t) 



E hi[| 



= E hi 



Kvt\y<t) 
f- = \ e{vt\v<t) 

tl(yi:n\xi-. n ) 
Q[yi:n\xi:n) 



(15) 



< 



lnw^ 1 , 



since always ^ < w 
u — 1 to obtain 



-l 



For the second part, use In it < 



Eln 



< 



/ Q(yi-.t)dy t 



e(y<t) 

J g(yi-.t)dy t 



E E 



Q(y<t) 



Kv<t)(S Q(yi-.t)dy t - g{y<t] 



dy 



Q{y<t) 

Q(yi-.t)dyi:t - I g{y<t)dy <t 



<t 



< w„ 



If this is summed over t — 1 . . . n, the last term is 
telescoping. So using g(0) = max„ w v > and g> < £, 
we conclude 



g(yi:n)dy 1:n - £>( 



-1 



< w n 1 £(yi:n)dyi-. n 



(16) 



Hence, l|T5)l. and |(TSJ show together {T3Jl- D 

We may for example apply the result for the static 
predictions in a Gaussian noise class C E jF Gauss . 

Corollary 5 Let C e ^ auss [see 0)/ i/ien £/ie mean 
and the variance of the static MDL predictions con- 
verge to their true values almost surely. The same 
holds for C G jrGauss_ j n p ar ticular, if the vari- 
ance of all models in C is the same value a 2 , then 



(g-{x t \...)-f(x t )Y 



< 21w„ , where 



£ t 2[l-exp(- 

f(x t ) is the mean value of the true distribution and 

g* = argmin /! {^ T ErJi 1 (y*-/ l (^)) 2 + 2^1n W r 1 } 
is the mean of the MDL predictor. 

For C € ^ auss , almost sure convergence holds since 
otherwise the cumulative Hellinger distances would 
be infinite, see JZj). This generalizes to C € j^Gauss. 
compare the footnote |2 on page In the case of 
constant variance, the cumulative Hellinger distances 
can be explicitly stated as above. Note that since 

1 _ PYT1 ( _ {g*(xt\-)-f{xt)) 2 \ _ (g'{x t \...)-f(x t )) 2 , 

small (g*(a; t | . . .) — f(x t )) , this implies convergence of 
g* to / faster than 0(-^) if the convergence is mono- 
tone. Moreover, deviations of a fixed magnitude can 
only occur finitely often. 

Compared with the bound for the Bayes mixture in 
Theorem |2 MDL bounds are exponentially larger. 
The bounds are sharp, as shown in il'l 10 la Example 
9] , this example may be also adapted to the regression 
framework. 



3. Classification 

The classification setup is technically easier, since only 
a finite co-domain y has to be considered. Results cor- 
responding to Theorem |2| and Corollary ^follow anal- 
ogously. Alternatively, one may conditionalize the re- 
sults for sequence prediction in |PH04aj with respect 
to the input sequence x <OQ , arriving equally at the 



assertions for classification. The results in |P~H04a 
are formulated in terms of mean (square) sum conver- 
gence instead of Hellinger sum convergence. On finite 
co-domain, these two convergence notions induce the 
same topology. 

Theorem 6 Let X be arbitrary and y be a finite set 
of class labels. C = {i>i : i > 1} consists of clas- 
sification models, i.e. for each v € C, x € X and 
y € y we have v{y\x) > and *}2 y v{y\x) = 1. Each 
model v is associated with a prior weight w v > 0, and 
w v = 1 holds. Let the MDL predictions be de- 
fined analogously to and \1(J\) (the difference 
being that here probabilities are maximized instead of 
densities). Assume that \x € C, where fi is the true 
distribution. Then for each x <ao € X°° , 



2 

]Te^ (yn(y\xt) - V9 stat[c (y\xi:t,y<t))<2iw-\ 

t=i ye y 

(n(y\x t ) - g static (y\x 1:t ,y <t )) < 21w^ 

t=i yey 

holds. Similar assertions are satisfied for the normal- 
ized and the un-normalized dynamic MDL predictor. 
In particular, the predictive probabilities of all three 
MDL predictors converge to the true probabilities al- 
most surely. 

The second bound on the quadratic differences is 
shown in PH04a . The assertions about almost sure 
convergence follows as in 0. 

4. Discussion and Conclusions 

We have seen that discrete MDL has good asymp- 
totic predictive properties. On the other hand, the 
loss bounds for MDL are exponential compared to the 
Bayes mixture loss bound. This is no proof artifact, 
as examples are easily constructed where the bound is 
sharp |PH04a| . 

This has an important implication for the practical 
use of MDL: One need to choose the underlying model 
class and the prior carefully. Then it can be expected 
that the predictions are good and converge fast: this is 
supported by theoretical arguments in pR,is96 , PH 04b| . 
The Bayes mixture in contrast, which can be viewed 
as a very large (infinite) weighted committee, also con- 
verges rapidly with unfavorable model classes, but at 
higher computational expenses. 

One might be interested in other loss functions than 
the Hellinger loss. For the classification bound 
on the expected error loss (number of classification er- 
rors) of MDL may be derived with the techniques from 



Hut04 , using the bound on the quadratic distance. 
|Hut03aj gives also bounds for arbitrary loss functions, 
however this requires a bound on the Kullback-Leibler 
divergence rather than the quadratic distance. Unfor- 
tunately, this does not hold for static MDL P H04a| . 
For the regression setup, analysis of other, more gen- 
eral or even arbitrary loss functions is even more de- 
manding and, as far as we know, open. 

Considering only discrete model classes is certainly a 
restriction, since many models arising in science (e.g. 
physics or biology) are continuous. On the other hand 
there are arguments in favor of discrete classes. From 
a computational point of view they are definitely suffi- 
cient. Real computers may even treat only finite model 
classes. The class of all programs on a fixed univer- 
sal Turing machine is countable. It may be related 
to discrete classes of stochastic models by the means 
of semimeasures, this is one of the central issues in 
Algorithmic Information Theory |LV97| . 
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